My IODS project github data repository ↗
Welcome to my IODS 2019 course project page!
This project diary is basically me trying to learn how to do statistical analyses with R. I look forward to improving my R skills and learning some good open data practices.
What I expect to learn?
Where I learned about the course?
I think I just brosed the HYMY courses in WebOodi
The data is from Kimmo Vehkalahti’ study ASSIST 2014, which measured the approaches to learning of 183 university students in the Introduction to Social Statistics course in Fall 2014.
## [1] 166 7
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data has 7 variables:
Composite variables assessing the approaches to learning were formed combining items measuring each construct by calculating the item means for each participant. 166 cases were included in the analyses as 17 cases with the score 0 for exam points were omitted.
## # A tibble: 2 x 2
## gender n
## <fct> <int>
## 1 F 110
## 2 M 56
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] 25.51205
## [1] 17 55
There are 110 females and 56 males in the dataset and the ages of the participants range from 17 to 55 years with the average age being 25.5.
p2 <- ggplot(data, aes(attitude)) + geom_histogram(color="white", fill="darkturquoise", binwidth = 2)
p3 <- ggplot(data, aes(deep)) + geom_histogram(color="white", fill="darkturquoise", binwidth = 0.2)
p4 <- ggplot(data, aes(stra)) + geom_histogram(color="white", fill="darkturquoise", binwidth = 0.2)
p5 <- ggplot(data, aes(surf)) + geom_histogram(color="white", fill="darkturquoise", binwidth = 0.2)
p6 <- ggplot(data, aes(points),) + geom_histogram(color="white", fill="darkturquoise", binwidth = 2)
figure <- ggarrange(p2, p3, p4, p5, p6,
labels = c("attitude", "deep", "stra", "surf", "points"),
ncol = 3, nrow = 2)
figurep7 <- ggpairs(data, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p7## vars n mean sd median trimmed mad min max range skew
## age 1 166 25.51 7.77 22.00 23.99 2.97 17.00 55.00 38.00 1.89
## attitude 2 166 31.43 7.30 32.00 31.52 7.41 14.00 50.00 36.00 -0.08
## deep 3 166 3.68 0.55 3.67 3.70 0.62 1.58 4.92 3.33 -0.50
## stra 4 166 3.12 0.77 3.19 3.14 0.83 1.25 5.00 3.75 -0.11
## surf 5 166 2.79 0.53 2.83 2.78 0.62 1.58 4.33 2.75 0.14
## points 6 166 22.72 5.89 23.00 23.08 5.93 7.00 33.00 26.00 -0.40
## kurtosis se
## age 3.24 0.60
## attitude -0.48 0.57
## deep 0.66 0.04
## stra -0.45 0.06
## surf -0.27 0.04
## points -0.26 0.46
Other variables seem to be quite symmetrically distributed with a slight negative skewness on deep learning and exam points. By visual inspection, it seems there might be a gender difference in attitude towards statistics with men having more positive attitude in average.
Attitude and exam points are moderately correlated. There is also a weak positive and a weak negative correlation between exam points and strategic learning, and exam points and surface learning, respectively. Intrestingly, deep learning is not correlated with exam points. Surface learning is also negatively correlated to deep and surface learning as well as attitude (r = -.32 … -.16).
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
I chose attitude, strategic learning and surgace learning as predictors in my regression model based on the strenght of correlation on exam points.
The F-statistic gives a test of the omnibus null hypothesis that all regression coefficients are zero. The F-statistic for the Model 1 is 14.13 with a p value less than .001. Following, it is highly unlikely that there are no non-zero regressions in the model and we can reject the null hypothesis.
The square of the multiple correlation coefficient (R^2) is .207, which signifies that the variables in the model account for about 21% of the variation in exam points.
However, the non-significant t-value of strategic and surface learning implies that attitude seems to be the only statistically significant predictor in the model (t = 5.91, p < .001). The t test tests whether the regression coefficient differs from zero.
The unstandardized regression coefficients are reported under “Estimate” in the “Coefficients” table. The coefficient .34 (p < .001) of attitude implies the strength of the relationship between attitude and exam points in the original scales of the variables, when strategic and surface learning is controlled for. However, we cannot make judgements about the relative importance of the predictor on the predicted variable using unstandardized coefficients. We can obtain the standardized values by multiplying the raw regression coefficient by multiplying the raw coefficient by the standard deviation of the explanatory variable and dividing by the standard deviation of the response variable:
attitude: 0.34 × 7.30 / 5.89 = 0.42
stra: 0.85 × 0.77 / 5.89 = 0.11
surf: -0.59 × 0.77 / 5.89 = -0.08
The standardized beta coefficient of attitude on exam points in this model is .42. Strategic (β = .11) and surface learning (β = -.08) did not statistically significantly predict exam points.
From the “Residuals” table, we can also see how the model residuals are distributed.
##
## Call:
## lm(formula = points ~ attitude, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In Model 2 the two nonsignificant predictors, strategic and surface learning, were omitted and exam points were predicted only by attitude. The null hypothesis is rejected with statistically significant F value 38.61, p < .001.
Interpreting the model
The squared multiple correlation coefficient (R^2) ind Model 2 is .19, which implies that attitude explains 19% of variance in exam points. According to the t test, the regression coefficient of attitude differs from 0 with t = 6.21, p < .001. The unstandardized regression coefficient of attitude on exam points is .35 (p < .001). This implies that when attitude towards statistics increases by 1 in the original scale, there is an average of 0.35 increase in exam score. The standardized regression coefficient is is 0.44, which implies that when attitude increases by 1 SD, exam points increase 0.44 SD. Based on the results, we can say that students with more positive attitude on average score higher on the course exam, but the effect is relatively weak.
Regression diagnostics
The regression model has several assumptions:
The Residuals vs. Fitted plot shows a linear relationship between the variables. The normality of the variables was examined before the analysis via visuals and descriptive statistics. As there is only one predictor, there cannot be multicollinearity. The standardized residuals also seem to be normally distributed as implied by the relatively straight line in the Normal Q-Q plot. The Scale-Loaction plot shows that the residuals are spread equally along the range of the predictor which implies homoscedasticity. Lastly, the Residuals vs Leverage plot shows no influential outliers.
The data is Student Performance Data Set from UCI Machine Learning Repository. It approaches student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). Full description of the original data can be found here.
The two datasets have been merged using several variables as identifiers to combine individual students information: school, sex, age, address, family size, parents’ cohabitation status, parents’ education and job, reason to choose school, attendance to school nursery, and home internet access. If the student had answered to the same question on both questionnaires, the the rounded average was calculated. If the question was non-numeric, the answer on Mathematics performance dataset was used.
The R script about creating the merged dataset can be found here.
## [1] X school sex age address famsize
## [7] Pstatus Medu Fedu Mjob Fjob reason
## [13] nursery internet guardian traveltime studytime failures
## [19] schoolsup famsup paid activities higher romantic
## [25] famrel freetime goout Dalc Walc health
## [31] absences G1 G2 G3 alc_use high_use
| Variables | Information |
|---|---|
| sex | ‘s sex (binary: ’F’ - female or ‘M’ - male) |
| Medu | mother’s education (numeric: 0 none, 1 primary education (4th grade), 2 5th to 9th grade, 3 secondary education or 4 higher education) |
| failures | number of past class failures (numeric: n if 1<=n<3, else 4) |
| absences | number of school absences (numeric: from 0 to 93) |
| high_use | high alcohol consumption (TRUE: the average of self-reported workday and weekend alcohol consumption greater than 2 on a scale 1 -very low - 5 very high, FALSE: the average 2 or lower) |
Let’s take a glimpse of the structure and dimensions of the subset of data we are using:
## Observations: 382
## Variables: 5
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,...
As Medu is truly a categorical variable, not a numerical one, it will have to be recoded into a factor.
alc$Medu <- factor(alc$Medu)
levels(alc$Medu) <- c("None", "Primary", "5th to 9th", "Secondary", "Higher")The purpose of my analysis is to study the relationships between high/low alcohol consumption and students’ demographic, social, and school-realted characetristics.
I chose following variables to explain the students’ alcohol consumption: student’s sex, father’s education, class failures, and absentees.
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sex* | 1 | 382 | 1.482 | 0.500 | 1 | 1.477 | 0.000 | 1 | 2 | 1 | 0.073 | -2.000 | 0.026 |
| Medu* | 2 | 382 | 3.806 | 1.086 | 4 | 3.892 | 1.483 | 1 | 5 | 4 | -0.384 | -1.037 | 0.056 |
| failures | 3 | 382 | 0.202 | 0.583 | 0 | 0.033 | 0.000 | 0 | 3 | 3 | 3.034 | 8.689 | 0.030 |
| absences | 4 | 382 | 4.500 | 5.463 | 3 | 3.536 | 2.965 | 0 | 45 | 45 | 3.187 | 16.186 | 0.279 |
| G3 | 5 | 382 | 11.458 | 3.310 | 12 | 11.631 | 2.965 | 0 | 18 | 18 | -0.459 | 0.180 | 0.169 |
p1 <- ggplot(alc, aes(x = sex, fill = sex))
p1 + geom_bar() + theme(legend.position = "none") + ggtitle("Students' sex")+ ylab(" ") + xlab("Sex")| sex | n |
|---|---|
| F | 198 |
| M | 184 |
p2 <- ggplot(alc, aes(x = high_use, fill = sex))
p2 + geom_bar() + theme(legend.position = "none") + facet_wrap(~sex) +
labs(title = "High alcohol consumption of female (left) and male (right) students",
caption = "TRUE = high consumption, FALSE = low consumption") +
ylab(" ") + xlab("High alcohol consumption")alc %>%
group_by(high_use, sex)%>%
summarise(n=n())%>%
spread(sex, n) %>%
kable(caption = "<b>Table 3</b> High/low alcohol consumption by students' sex crosstabulated") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| high_use | F | M |
|---|---|---|
| FALSE | 156 | 112 |
| TRUE | 42 | 72 |
p3 <- ggplot(alc, aes(x = Medu))
p3 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Mother's education") + ylab(" ") + xlab("Education level of mother")alc %>%
group_by(high_use, Medu)%>%
summarise(n=n())%>%
spread(Medu, n) %>%
kable(caption = "<b>Table 4</b> High/low alcohol consumption by mother's education crosstabulated") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| high_use | None | Primary | 5th to 9th | Secondary | Higher |
|---|---|---|---|---|---|
| FALSE | 1 | 33 | 80 | 59 | 95 |
| TRUE | 2 | 18 | 18 | 36 | 40 |
p3 <- ggplot(alc, aes(x = failures))
p3 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Class failures") + ylab(" ") + xlab("Number of class failures")alc %>%
group_by(high_use, failures)%>%
summarise(n=n())%>%
spread(failures, n) %>%
kable(caption = "<b>Table 5</b> High/low alcohol consumption by mother's education crosstabulated") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| high_use | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| FALSE | 244 | 12 | 10 | 2 |
| TRUE | 90 | 12 | 9 | 3 |
p4 <- ggplot(alc, aes(x = absences))
p4 + geom_bar(fill = "deepskyblue2") + theme_classic() + ggtitle("Absences") + ylab(" ") + xlab("Number of absences")p5 <- ggplot(alc, aes(x = high_use, y = absences, fill = sex))
p5 + geom_boxplot() + ggtitle("Boxplots of absences vs. high/low alcohol use by gender") + ylab("Number of absences") + xlab("High alcohol consumption")alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences)) %>% kable(digits = 2, caption = "<b>Table 5</b> Mean number absences by high/low alcohol consumption and sex") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| sex | high_use | count | mean_absences |
|---|---|---|---|
| F | FALSE | 156 | 4.22 |
| F | TRUE | 42 | 6.79 |
| M | FALSE | 112 | 2.98 |
| M | TRUE | 72 | 6.12 |
p6 <- ggplot(alc, aes(x = G3))
p6 + geom_histogram(color = "white", fill = "deepskyblue2", bins=18) + theme_classic() + ggtitle("Grade") + ylab(" ") + xlab("Grade")p7 <- ggplot(alc, aes(x = high_use, y = G3, fill = sex))
p7 + geom_boxplot() + ggtitle("Boxplots of grades vs. high/low alcohol use by gender") + ylab("Grade") + xlab("High alcohol consumption")alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3)) %>% kable(digits = 3, caption = "<b>Table 6</b> Mean grade by high/low alcohol consumption and sex") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| sex | high_use | count | mean_grade |
|---|---|---|---|
| F | FALSE | 156 | 11.397 |
| F | TRUE | 42 | 11.714 |
| M | FALSE | 112 | 12.205 |
| M | TRUE | 72 | 10.278 |
# Fitting the logistic regression
mod1 <- glm(high_use ~ Medu + failures + absences + sex, data = alc, family = "binomial")
# Summary of the model
summary(mod1)##
## Call:
## glm(formula = high_use ~ Medu + failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3466 -0.8419 -0.6057 1.0340 2.3030
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.07693 1.26355 0.061 0.9514
## MeduPrimary -1.67979 1.29798 -1.294 0.1956
## Medu5th to 9th -2.65576 1.29394 -2.052 0.0401 *
## MeduSecondary -1.72156 1.28764 -1.337 0.1812
## MeduHigher -2.00750 1.28668 -1.560 0.1187
## failures 0.43321 0.20127 2.152 0.0314 *
## absences 0.09627 0.02427 3.967 7.28e-05 ***
## sexM 0.97945 0.24932 3.928 8.55e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 412.83 on 374 degrees of freedom
## AIC: 428.83
##
## Number of Fisher Scoring iterations: 4
As Medu does not consistently predict high alcohol consumption well (the only significant coefficient being class “5th to 9th” [p =.040], I omitted the variable from the model. This also makes the model easier to interpret.
# Fittiing Model 2
mod2 <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# Summary of Model 2
summary(mod2)##
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1855 -0.8371 -0.6000 1.1020 2.0209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.90297 0.22626 -8.411 < 2e-16 ***
## failures 0.45082 0.18992 2.374 0.017611 *
## absences 0.09322 0.02295 4.063 4.85e-05 ***
## sexM 0.94117 0.24200 3.889 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 424.40 on 378 degrees of freedom
## AIC: 432.4
##
## Number of Fisher Scoring iterations: 4
# Computing odds ratios and confidence intervals
OR <- coef(mod2) %>% exp
CI <- confint(mod2) %>% exp## Waiting for profiling to be done...
# Printing out the odds ratios with their confidence intervals
cbind(OR, CI) %>% kable(digits = 3, caption = "<b>Table 7</b> Odds ratios and their confidence intervals") %>% kable_styling(bootstrap_options = c("striped", "hover"))| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.149 | 0.094 | 0.229 |
| failures | 1.570 | 1.083 | 2.295 |
| absences | 1.098 | 1.052 | 1.151 |
| sexM | 2.563 | 1.604 | 4.149 |
As we can see from the model summary, the intercept of high alcohol consumption is -1.90, which is more than 8 standard deviations (z value or the Wald’s Test value) away from 0 on the standard normal curve with a statistically significant p < .001. The slope coefficient of, for example, absences is 0.093. This means that for one point increase in absences the log of the odds of high alcohol consumption increases 0.09. The z values of coefficients of failures, absences, and sex ar positive and over 2 standard deviations away from 0 and are statistically significant with p < .05 for failures and p < .001 for absences and sex.
From the odds ratios we we can see that when the effect of the other predictor variables are taken into account…
Conclusion: As expected, class failures, school absences, and student’s sex predict higher alcohol consumption. Male students are more likely be high alcohol consumers. Class failures and absentees also increase the probability of higher consumption. However, mother’s education doesn’t seem to predict alcohol use consistently.
# predict() the probability of high_use
probabilities <- predict(mod1, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, Medu, failures, absences, sex, high_use, probability, prediction) %>% tail(10)## Medu failures absences sex high_use probability prediction
## 373 Primary 1 0 M FALSE 0.45259271 FALSE
## 374 Higher 1 7 M TRUE 0.53891660 TRUE
## 375 5th to 9th 0 1 F FALSE 0.07708996 FALSE
## 376 Higher 0 6 F FALSE 0.20538904 FALSE
## 377 5th to 9th 1 2 F FALSE 0.12421781 FALSE
## 378 Secondary 0 2 F FALSE 0.18968092 FALSE
## 379 Primary 2 2 F FALSE 0.36728009 FALSE
## 380 Primary 0 3 F FALSE 0.21181023 FALSE
## 381 Secondary 0 4 M TRUE 0.43043082 FALSE
## 382 Secondary 0 2 M TRUE 0.38399298 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)## prediction
## high_use FALSE TRUE
## FALSE 257 11
## TRUE 78 36
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67277487 0.02879581 0.70157068
## TRUE 0.20418848 0.09424084 0.29842932
## Sum 0.87696335 0.12303665 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)## [1] 0.2329843
The average number of frong predictions in the data is 23% using student’s sex, absences, and class failures as predictors. This means that the prediction was right about three times out of four. This is significantly better than by just guessing (error rate of 50%). The model was especially accurate at predicting low alcohol consumption by preidcting correctly 257 times out of 268. However, the model predicted wrong most of the cases where the alcohol consumption was categorized high.
Load the Boston data from the MASS package. Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it.
library(MASS); library(tidyverse); library(psych); library(knitr);
library(kableExtra); library(ggplot2); library(corrplot); library(ggpubr)
library(GGally); library(caret)The Boston dataset (Harrison & Rubinfield , 1978; Belsey, Kuh, & Welsch, 1980) from the MASS package is about the housing values in suburbs of Boston. Information about the dataset can be found here. The dataset contains following variables:
| Variables | Information |
|---|---|
| crim | per capita crime rate by town. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| nox | nitrogen oxides concentration (parts per 10 million). |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town. |
| lstat | lower status of the population (percent). |
| medv | median value of owner-occupied homes in $1,000s. |
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
The dataset has 506 observations and 14 variables.
Original source:
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
# exploring descripotive statistics
describe(Boston) %>% kable(caption = "<b>Table 2</b> Descriptive statistics", digits = 2) %>%
kable_styling(bootstrap_options = "striped", "hover")| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| crim | 1 | 506 | 3.61 | 8.60 | 0.26 | 1.68 | 0.33 | 0.01 | 88.98 | 88.97 | 5.19 | 36.60 | 0.38 |
| zn | 2 | 506 | 11.36 | 23.32 | 0.00 | 5.08 | 0.00 | 0.00 | 100.00 | 100.00 | 2.21 | 3.95 | 1.04 |
| indus | 3 | 506 | 11.14 | 6.86 | 9.69 | 10.93 | 9.37 | 0.46 | 27.74 | 27.28 | 0.29 | -1.24 | 0.30 |
| chas | 4 | 506 | 0.07 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 3.39 | 9.48 | 0.01 |
| nox | 5 | 506 | 0.55 | 0.12 | 0.54 | 0.55 | 0.13 | 0.38 | 0.87 | 0.49 | 0.72 | -0.09 | 0.01 |
| rm | 6 | 506 | 6.28 | 0.70 | 6.21 | 6.25 | 0.51 | 3.56 | 8.78 | 5.22 | 0.40 | 1.84 | 0.03 |
| age | 7 | 506 | 68.57 | 28.15 | 77.50 | 71.20 | 28.98 | 2.90 | 100.00 | 97.10 | -0.60 | -0.98 | 1.25 |
| dis | 8 | 506 | 3.80 | 2.11 | 3.21 | 3.54 | 1.91 | 1.13 | 12.13 | 11.00 | 1.01 | 0.46 | 0.09 |
| rad | 9 | 506 | 9.55 | 8.71 | 5.00 | 8.73 | 2.97 | 1.00 | 24.00 | 23.00 | 1.00 | -0.88 | 0.39 |
| tax | 10 | 506 | 408.24 | 168.54 | 330.00 | 400.04 | 108.23 | 187.00 | 711.00 | 524.00 | 0.67 | -1.15 | 7.49 |
| ptratio | 11 | 506 | 18.46 | 2.16 | 19.05 | 18.66 | 1.70 | 12.60 | 22.00 | 9.40 | -0.80 | -0.30 | 0.10 |
| black | 12 | 506 | 356.67 | 91.29 | 391.44 | 383.17 | 8.09 | 0.32 | 396.90 | 396.58 | -2.87 | 7.10 | 4.06 |
| lstat | 13 | 506 | 12.65 | 7.14 | 11.36 | 11.90 | 7.11 | 1.73 | 37.97 | 36.24 | 0.90 | 0.46 | 0.32 |
| medv | 14 | 506 | 22.53 | 9.20 | 21.20 | 21.56 | 5.93 | 5.00 | 50.00 | 45.00 | 1.10 | 1.45 | 0.41 |
# visualizing distributions
plots <- list(names(Boston[,-4]))
for(i in names(Boston[,-4])) {
plots[[i]] <- ggplot(data=Boston) + geom_histogram(aes_string(x=i),
color="darkslategray", fill="darkturquoise",
bins = 20)
}
plots$chas <- ggplot(Boston, aes(chas)) + geom_bar(color="darkslategray", fill="darkturquoise") +
scale_x_continuous(breaks = 0:1)
fig1 <- ggarrange(plots$crim, plots$zn, plots$indus,
plots$chas, plots$nox, plots$rm,
plots$age, plots$dis, plots$rad,
ncol = 3, nrow = 3)
fig2 <- ggarrange(plots$tax, plots$ptratio, plots$black,
plots$lstat, plots$medv,
ncol = 3, nrow = 3)
fig1Variables crim and zn, are considerably positively skewed with a strong floor effect, whereas black is negatively skewed with a strong ceiling effect. Variables indus, rad, and tax seem to be are bimodal or have gaps in the histogram followed by high value peaks.
# correlation matrix
cor.mat <- cor(Boston)
cor.mat %>% kable(digits = 2, caption = "<b>Table 3</b> Correlations") %>%
kable_styling(bootstrap_options = "striped", "hover")| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| crim | 1.00 | -0.20 | 0.41 | -0.06 | 0.42 | -0.22 | 0.35 | -0.38 | 0.63 | 0.58 | 0.29 | -0.39 | 0.46 | -0.39 |
| zn | -0.20 | 1.00 | -0.53 | -0.04 | -0.52 | 0.31 | -0.57 | 0.66 | -0.31 | -0.31 | -0.39 | 0.18 | -0.41 | 0.36 |
| indus | 0.41 | -0.53 | 1.00 | 0.06 | 0.76 | -0.39 | 0.64 | -0.71 | 0.60 | 0.72 | 0.38 | -0.36 | 0.60 | -0.48 |
| chas | -0.06 | -0.04 | 0.06 | 1.00 | 0.09 | 0.09 | 0.09 | -0.10 | -0.01 | -0.04 | -0.12 | 0.05 | -0.05 | 0.18 |
| nox | 0.42 | -0.52 | 0.76 | 0.09 | 1.00 | -0.30 | 0.73 | -0.77 | 0.61 | 0.67 | 0.19 | -0.38 | 0.59 | -0.43 |
| rm | -0.22 | 0.31 | -0.39 | 0.09 | -0.30 | 1.00 | -0.24 | 0.21 | -0.21 | -0.29 | -0.36 | 0.13 | -0.61 | 0.70 |
| age | 0.35 | -0.57 | 0.64 | 0.09 | 0.73 | -0.24 | 1.00 | -0.75 | 0.46 | 0.51 | 0.26 | -0.27 | 0.60 | -0.38 |
| dis | -0.38 | 0.66 | -0.71 | -0.10 | -0.77 | 0.21 | -0.75 | 1.00 | -0.49 | -0.53 | -0.23 | 0.29 | -0.50 | 0.25 |
| rad | 0.63 | -0.31 | 0.60 | -0.01 | 0.61 | -0.21 | 0.46 | -0.49 | 1.00 | 0.91 | 0.46 | -0.44 | 0.49 | -0.38 |
| tax | 0.58 | -0.31 | 0.72 | -0.04 | 0.67 | -0.29 | 0.51 | -0.53 | 0.91 | 1.00 | 0.46 | -0.44 | 0.54 | -0.47 |
| ptratio | 0.29 | -0.39 | 0.38 | -0.12 | 0.19 | -0.36 | 0.26 | -0.23 | 0.46 | 0.46 | 1.00 | -0.18 | 0.37 | -0.51 |
| black | -0.39 | 0.18 | -0.36 | 0.05 | -0.38 | 0.13 | -0.27 | 0.29 | -0.44 | -0.44 | -0.18 | 1.00 | -0.37 | 0.33 |
| lstat | 0.46 | -0.41 | 0.60 | -0.05 | 0.59 | -0.61 | 0.60 | -0.50 | 0.49 | 0.54 | 0.37 | -0.37 | 1.00 | -0.74 |
| medv | -0.39 | 0.36 | -0.48 | 0.18 | -0.43 | 0.70 | -0.38 | 0.25 | -0.38 | -0.47 | -0.51 | 0.33 | -0.74 | 1.00 |
Correlogram showing correlations with p < .05
# p-value matrix
cor.mtest <- function(cor.mat, ...) {
mat <- as.matrix(cor.mat)
n <- ncol(cor.mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
p.mat <- cor.mtest(Boston)
# correlogram
corrplot.mixed(cor.mat, lower = "circle", upper = "number", tl.col = "black",
tl.cex = 0.7, number.cex = 0.8, number.digits = 2,
p.mat = p.mat, sig.level = 0.05, insig = "blank")The correlation matrix shows that:
The crime rate of the suburb correlates moderately with accessibility to radial highways (r = .63) and property tax rate (r = .58).
Crime is also correlated with low-to-moderate positive associations (.30 < r < .50) with
…and low-to-moderate negative associations (-.30 > r > -.50) with
Some other notable relationships:
Standardize the dataset and print out summaries of the scaled data.
# stanradizing the dataset and changing the new ohbject to data frame
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
How did the variables change?
All the variables have a mean of 0 and a standard deviation of 1.
Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset.
# quantile vector
bins <- quantile(boston_scaled$crim)
# creating a categorical variable
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# removing original variable and adding the new categorical variable
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# n of rows
n <- nrow(boston_scaled)
# random 80% of rows
ind <- sample(n, size = n * 0.8)
# creating train and test sets
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2351485 0.2623762 0.2500000
##
## Group means:
## zn indus chas nox rm
## low 0.94094679 -0.8959715 -0.11793298 -0.8727845 0.44333257
## med_low -0.07155763 -0.2842787 -0.02367011 -0.5833766 -0.09913712
## med_high -0.38126085 0.1769416 0.09909548 0.3808769 0.13833395
## high -0.48724019 1.0171306 -0.07742312 1.0389754 -0.42839934
## age dis rad tax ptratio black
## low -0.8697248 0.8648358 -0.6891247 -0.7368011 -0.4042325 0.3841720
## med_low -0.3182762 0.3639280 -0.5406181 -0.4141852 -0.0144683 0.3544590
## med_high 0.3790362 -0.3999683 -0.3859685 -0.3034513 -0.3472420 0.1173889
## high 0.8022887 -0.8555972 1.6379981 1.5139626 0.7806252 -0.7744308
## lstat medv
## low -0.78741307 0.51044814
## med_low -0.13017526 0.01680552
## med_high -0.01799942 0.22845843
## high 0.86723470 -0.64625749
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09265080 0.690692365 -0.938339866
## indus -0.01258406 -0.286296112 0.078936464
## chas -0.06149673 -0.064245244 0.094953568
## nox 0.38947298 -0.749187820 -1.237607845
## rm -0.10239619 -0.063181998 -0.092706113
## age 0.28121976 -0.232175332 0.005510413
## dis -0.08586298 -0.203661919 0.292153128
## rad 2.99895536 0.789247536 -0.502213317
## tax -0.00768637 0.132269951 1.033538836
## ptratio 0.10957092 0.081357286 -0.159242193
## black -0.15429920 0.008970574 0.152074437
## lstat 0.21324366 -0.198330351 0.496024922
## medv 0.17703583 -0.385968260 -0.044342221
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9457 0.0401 0.0142
lda.fit$means %>% kable(digits=2, caption = "<b>Table 4<b/>") %>%
kable_styling(bootstrap_options = "striped", "hover")| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| low | 0.94 | -0.90 | -0.12 | -0.87 | 0.44 | -0.87 | 0.86 | -0.69 | -0.74 | -0.40 | 0.38 | -0.79 | 0.51 |
| med_low | -0.07 | -0.28 | -0.02 | -0.58 | -0.10 | -0.32 | 0.36 | -0.54 | -0.41 | -0.01 | 0.35 | -0.13 | 0.02 |
| med_high | -0.38 | 0.18 | 0.10 | 0.38 | 0.14 | 0.38 | -0.40 | -0.39 | -0.30 | -0.35 | 0.12 | -0.02 | 0.23 |
| high | -0.49 | 1.02 | -0.08 | 1.04 | -0.43 | 0.80 | -0.86 | 1.64 | 1.51 | 0.78 | -0.77 | 0.87 | -0.65 |
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plotting the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset.
# saving crime categories
correct_classes <- test$crime
# removing the variable from test data
test <- dplyr::select(test, -crime)Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
# predicting the classes with the model
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class) %>%
kable(caption = "<b>Table 5</b> Correct and predicted crime classes cross-tabulated") %>%
kable_styling(bootstrap_options = "striped", "hover") %>%
add_header_above(c("Correct crime class", "Predicted crime class" = 4))| low | med_low | med_high | high | |
|---|---|---|---|---|
| low | 15 | 10 | 0 | 0 |
| med_low | 8 | 17 | 6 | 0 |
| med_high | 0 | 10 | 10 | 0 |
| high | 0 | 0 | 0 | 26 |
## Confusion Matrix and Statistics
##
## Reference
## Prediction low med_low med_high high
## low 15 8 0 0
## med_low 10 17 10 0
## med_high 0 6 10 0
## high 0 0 0 26
##
## Overall Statistics
##
## Accuracy : 0.6667
## 95% CI : (0.5664, 0.7569)
## No Information Rate : 0.3039
## P-Value [Acc > NIR] : 5.016e-14
##
## Kappa : 0.5488
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: low Class: med_low Class: med_high Class: high
## Sensitivity 0.6000 0.5484 0.50000 1.0000
## Specificity 0.8961 0.7183 0.92683 1.0000
## Pos Pred Value 0.6522 0.4595 0.62500 1.0000
## Neg Pred Value 0.8734 0.7846 0.88372 1.0000
## Prevalence 0.2451 0.3039 0.19608 0.2549
## Detection Rate 0.1471 0.1667 0.09804 0.2549
## Detection Prevalence 0.2255 0.3627 0.15686 0.2549
## Balanced Accuracy 0.7481 0.6333 0.71341 1.0000
Overall the prediction accuracy of the model is about 70 % (changes when the code is rerun, because the test data is sampled randomly every time), which is quite a high accuracy rate with 4 classes. It seems that the predictions are more accurate when the true high crime rate is high.
Reload the Boston dataset and standardize the dataset.
Calculate the distances between the observations.
# euclidean distance matrix
dist_eu <- dist(Bos)
#manhattan distance matrix
dist_man <- dist(Bos, method = "manhattan")
# summaries
summary(dist_eu)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results.
# k-means clustering
km <-kmeans(Bos, centers = 4)
# determining k
set.seed(500)
k_max <- 10
# the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Bos, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line') + scale_x_continuous(breaks = 1:10)# k-means clustering
km <-kmeans(Bos, centers = 2)
# comparing clusters
ggpairs(Bos[1:5], aes(col=factor(km$cluster), alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero
ggpairs(Bos[6:10], aes(col=factor(km$cluster), alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))ggpairs(Bos[11:14], aes(col=factor(km$cluster), alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))It seems the most radical change (the “elbow”) in the total of within cluster sum of squares (WCSS) happens between one and two clusters. Although WCSS decreases with increase in clusters, the changes become increasingly subtle. Therefore, I chose a 2 cluster solution.
The clusters seem to differ in crime rate with cluster 2 with virtually no crime. The clusters also differ in proportion of non-retail business, nitrogen oxide concentration, proportion of old buildings, distance to employment centers, accessibilty of highways, property tax, and pupil-teacher ratio.
Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them.
Original data from is from United Nations’ Human Development Reports. The script where data was modified can be found here.
The data combines several indicators from most countries in the world.
| Variables | Information |
|---|---|
| edu2FM | Proportion of females with at least secondary education devided by proportion of men with at least secondary education |
| labFM | Proportion of femals in the larbour force devided by proportion of males in the labour force |
| edu.exp | Expected years of schooling |
| life.exp | Life expectancy at birth |
| gni | Gross National Income per capita |
| mat.mor | Maternal mortality ratio |
| ado.birth | Adolescent birth rate |
| parlF | Proportion of female representatives in parliament |
## Observations: 155
## Variables: 8
## $ edu2FM <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.969060...
## $ labFM <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.828611...
## $ edu.exp <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, 15.9...
## $ life.exp <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, 82.0...
## $ gni <int> 64992, 42261, 56431, 44025, 45435, 43919, 39568, 529...
## $ mat.mor <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, 2, 1...
## $ ado.birth <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, 25.3...
## $ parlF <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, 28.2...
## edu2FM labFM edu.exp life.exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni mat.mor ado.birth parlF
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
describe(data) %>% kable(digits = 2, caption = "<b>Table 2</b> Descriptives") %>%
kable_styling(bootstrap_options = c("hover", "striped"))| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| edu2FM | 1 | 155 | 0.85 | 0.24 | 0.94 | 0.87 | 0.12 | 0.17 | 1.50 | 1.33 | -0.76 | 0.55 | 0.02 |
| labFM | 2 | 155 | 0.71 | 0.20 | 0.75 | 0.73 | 0.17 | 0.19 | 1.04 | 0.85 | -0.87 | 0.05 | 0.02 |
| edu.exp | 3 | 155 | 13.18 | 2.84 | 13.50 | 13.24 | 2.97 | 5.40 | 20.20 | 14.80 | -0.20 | -0.34 | 0.23 |
| life.exp | 4 | 155 | 71.65 | 8.33 | 74.20 | 72.40 | 7.56 | 49.00 | 83.50 | 34.50 | -0.76 | -0.15 | 0.67 |
| gni | 5 | 155 | 17627.90 | 18543.85 | 12040.00 | 14552.58 | 13337.47 | 581.00 | 123124.00 | 122543.00 | 2.14 | 6.83 | 1489.48 |
| mat.mor | 6 | 155 | 149.08 | 211.79 | 49.00 | 104.70 | 63.75 | 1.00 | 1100.00 | 1099.00 | 2.03 | 4.16 | 17.01 |
| ado.birth | 7 | 155 | 47.16 | 41.11 | 33.60 | 41.62 | 35.73 | 0.60 | 204.80 | 204.20 | 1.13 | 0.89 | 3.30 |
| parlF | 8 | 155 | 20.91 | 11.49 | 19.30 | 20.32 | 11.42 | 0.00 | 57.50 | 57.50 | 0.55 | -0.10 | 0.92 |
data %>%
gather(factor_key = T) %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(color="grey40", fill="deeppink", bins = 30)Perform principal component analysis (PCA) on the not standardized human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomenons they relate to.
## edu2FM labFM edu.exp life.exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## gni mat.mor ado.birth parlF
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca2 <- prcomp(data_std)
biplot(pca2, choices = 1:2, cex = c(0.8, 1.2), col = c("grey40", "deeppink2"))Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.
Load the tea dataset from the package Factominer. Explore the data briefly: look at the structure and the dimensions of the data and visualize it. Then do Multiple Correspondence Analysis on the tea data (or to a certain columns of the data, it’s up to you). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots.
## Observations: 300
## Variables: 36
## $ breakfast <fct> breakfast, breakfast, Not.breakfast, Not.brea...
## $ tea.time <fct> Not.tea time, Not.tea time, tea time, Not.tea...
## $ evening <fct> Not.evening, Not.evening, evening, Not.evenin...
## $ lunch <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, N...
## $ dinner <fct> Not.dinner, Not.dinner, dinner, dinner, Not.d...
## $ always <fct> Not.always, Not.always, Not.always, Not.alway...
## $ home <fct> home, home, home, home, home, home, home, hom...
## $ work <fct> Not.work, Not.work, work, Not.work, Not.work,...
## $ tearoom <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.te...
## $ friends <fct> Not.friends, Not.friends, friends, Not.friend...
## $ resto <fct> Not.resto, Not.resto, resto, Not.resto, Not.r...
## $ pub <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, ...
## $ Tea <fct> black, black, Earl Grey, Earl Grey, Earl Grey...
## $ How <fct> alone, milk, alone, alone, alone, alone, alon...
## $ sugar <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, N...
## $ how <fct> tea bag, tea bag, tea bag, tea bag, tea bag, ...
## $ where <fct> chain store, chain store, chain store, chain ...
## $ price <fct> p_unknown, p_variable, p_variable, p_variable...
## $ age <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, ...
## $ SPC <fct> middle, middle, other worker, student, employ...
## $ Sport <fct> sportsman, sportsman, sportsman, Not.sportsma...
## $ age_Q <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-...
## $ frequency <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3...
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.e...
## $ spirituality <fct> Not.spirituality, Not.spirituality, Not.spiri...
## $ healthy <fct> healthy, healthy, healthy, healthy, Not.healt...
## $ diuretic <fct> Not.diuretic, diuretic, diuretic, Not.diureti...
## $ friendliness <fct> Not.friendliness, Not.friendliness, friendlin...
## $ iron.absorption <fct> Not.iron absorption, Not.iron absorption, Not...
## $ feminine <fct> Not.feminine, Not.feminine, Not.feminine, Not...
## $ sophisticated <fct> Not.sophisticated, Not.sophisticated, Not.sop...
## $ slimming <fct> No.slimming, No.slimming, No.slimming, No.sli...
## $ exciting <fct> No.exciting, exciting, No.exciting, No.exciti...
## $ relaxing <fct> No.relaxing, No.relaxing, relaxing, relaxing,...
## $ effect.on.health <fct> No.effect on health, No.effect on health, No....
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Warning: attributes are not identical across measure variables;
## they will be dropped
#{r 5_mca} #mca <- MCA(tea, graph = FALSE) #summary(mca) #plot(mca, invisible=c("ind"), habillage = "quali") #